14 research outputs found

    The Effectiveness of Decoupling

    Get PDF

    Multiple-banked Register File Architectures

    Get PDF
    The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more register ports) and the size of the instruction window (which implies more registers), and to use some kind of multithreading. Under this scenario, the register file access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates. However, a multi-stage register file has severe implications for processor performance (e.g. higher branch misprediction penalty) and complexity (more levels of bypass logic). To tackle these two problems, in this paper we propose a register file architecture composed of multiple banks. In particular we focus on a multi-level organization of the register file, which provides lo

    Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

    Get PDF
    Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for robotics and vision researchers to implement their algorithms in a performance-portable way. In this paper we introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPUaccelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives.Comment: 8 pages, ICRA 2015 conference pape

    Finishing the euchromatic sequence of the human genome

    Get PDF
    The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∌99% of the euchromatic genome and is accurate to an error rate of ∌1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead

    Simplifying Hardware for Out Of Order Execution using the Decoupling Paradigm

    No full text
    : Future hardware and software technology will try to provide improved performance by extracting higher levels of parallelism. However the cost of a main memory access - in terms of missed instruction issue slots - increases with faster processors and greater issue widths. For this reason latency hiding technology remains one of the most important parts of high performance processor designs. In this paper we investigate the behaviour of data prefetching on an access decoupled machine and a superscalar machine. Access decoupling is a latency hiding technique that partitions a program into two separate instruction streams to aggressively prefetch data. Superscalar architectures can support data prefetching through outof -order execution, non-blocking loads and lock-up free caches. In this paper we investigate if there are benefits to using the decoupling paradigm given that an outof -order (o-o-o) superscalar architecture could in principle prefetch to the same degree as an access decoup..

    Design Issues for Latency Hiding on an Access Decoupled Machine

    No full text
    Future software and hardware technologies will try to provide improved performance by extracting higher levels of parallelism. However the cost of a main memory access - in terms of missed instruction slots - increases with faster processors and greater issue widths. For this reason latency hiding technology remains one of the most important parts of high performance processor designs. In this paper we investigate a latency hiding technique known as Access Decoupling which partitions a program into two separate instruction streams in order to aggressively prefetch data. We justify a renewed interest in Access Decoupling in two ways. Firstly as a latency hiding technique and secondly as a solution to the problem of hardware complexity in large issue width, out-of-order superscalar architectures. We show that in comparison to a single instruction stream architecture Access Decoupling is marginally more effective at hiding memory latency and capable of achieving higher performance throu..
    corecore